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Abstract 

Scene  flow  methods  estimate  the  three-dimensional  motion  field  for  points  in  the  world, 
using  multi-camera  video  data.  Such  methods  combine  multi-view  reconstruction  with  mo¬ 
tion  estimation.  This  paper  describes  an  alternative  formulation  for  dense  scene  flow  es¬ 
timation  that  provides  reliable  results  using  only  two  cameras  by  fusing  stereo  and  opti¬ 
cal  flow  estimation  into  a  single  coherent  framework.  Internally,  the  proposed  algorithm 
generates  probability  distributions  for  optical  flow  and  disparity.  Taking  into  account  the 
uncertainty  in  the  intermediate  stages  allows  for  more  reliable  estimation  of  the  3D  scene 
flow  than  previous  methods  allow.  To  handle  the  aperture  problems  inherent  in  the  estima¬ 
tion  of  optical  flow  and  disparity,  a  multi-scale  method  along  with  a  novel  region-based 
technique  is  used  within  a  regularized  solution.  This  combined  approach  both  preserves 
discontinuities  and  prevents  over-regularization  -  two  problems  commonly  associated  with 
the  basic  multi-scale  approaches.  Experiments  with  synthetic  and  real  test  data  demonstrate 
the  strength  of  the  proposed  approach. 
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1  Introduction 


There  is  increasing  interest  in  methods  that  can  estimate  the  motion  of  a  3D  scene 
given  video  streams  obtained  via  a  multi-camera  rig.  The  resulting  estimates  of 
3D  scene  flow  can  be  used  in  a  wide  variety  of  applications  including  robotics, 
autonomous  navigation,  automated  model  reconstruction  and  reverse  engineering, 
human  motion  and  gesture  analysis,  virtual  reality  and  movie  special  effects,  video 
compression  and  retrieval,  etc. 
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While  the  demonstrated  applications  of  non-rigid  3D  scene  flow  estimation  are 
impressive,  some  aspects  of  the  3D  motion  estimation  problem  remain  open.  In 
particular,  since  the  estimation  of  3D  motion  relies  on  the  information  available 
from  2D  image  data,  the  estimation  of  3D  motion  is  generally  susceptible  to  image 
noise.  In  order  to  obtain  a  reliable  estimate  of  3D  motion,  usually  a  large  number  of 
cameras  are  used  in  the  stereo  rig.  There  are  also  problems  with  estimation  errors 
that  arise  in  the  regions  of  low  contrast  variation,  or  in  areas  of  the  surface  which 
are  visible  to  only  a  subset  of  the  cameras.  Ideally,  one  would  prefer  an  estimation 
algorithm  that  can  handle  these  problems  in  a  principled  way. 

One  of  the  main  contributions  of  the  paper  is  a  method  that  can  quantify  and  account 
for  errors  in  non-rigid  3D  scene  flow  estimates  that  arise  due  to  image  measurement 
errors.  This  is  in  direct  contrast  with  past  scene  flow  methods  [51,  59,  60];  image 
measurement  errors  propagate  through  the  2D  optical  flow  and  disparity  estimates, 
and  ignoring  this  can  lead  to  poor  3D  scene  flow  estimates.  Our  method  explicitly 
models  the  uncertainty  of  2D  optical  flow  and  disparity  estimation,  and  then  sys¬ 
tematically  accounts  for  this  uncertainty  information  in  the  estimation  of  3D  scene 
flow.  Our  experiments  on  data  with  known  ground  truth  show  that  by  incorporat¬ 
ing  the  2D  uncertainties,  the  angular  and  the  magnitude  errors  of  the  estimated  3D 
scene  flow  are  at  least  one  standard  deviation  smaller  than  those  obtained  with  [51]. 

Another  main  contribution  of  the  paper  is  a  unified  region-based,  multi-scale  al¬ 
gorithm  for  the  estimation  of  optical  flow  and  disparity.  Compared  to  [43],  the 
proposed  multi-scale,  region-based  algorithm  preserves  the  discontinuities,  while 
at  the  same  time,  the  optical  flow  and  disparity  within  the  same  region  are  regu¬ 
larized  through  a  parametric  model  fitting  process.  In  contrast  with  [6],  our  region 
model  fitting  process  makes  direct  use  of  the  2D  uncertainty  information  and  does 
not  require  setting  parameters  for  a  robust  error  norm.  Moreover,  our  algorithm 
is  capable  of  filling  in  the  estimates  of  optical  flow  and  disparity  for  regions  that 
are  only  visible  in  one  image  by  making  use  of  adjacent  regions  that  have  valid 
estimates  of  optical  flow  and  disparity.  Using  our  formulation,  we  demonstrate  im¬ 
proved  accuracy  of  optical  flow  and  disparity  using  standard  data  sets  [49,  50]. 

In  this  paper,  we  focus  on  the  challenging  case  of  estimating  dense  3D  scene  flow 
given  only  the  minimal  setup  of  two  cameras  in  the  stereo  rig.  Past  approaches  tend 
to  require  more  cameras  to  gain  a  reasonable  estimate  of  3D  scene  flow;  for  in¬ 
stance,  [51]  reported  experimental  results  in  a  rig  of  51  cameras.  In  experiments, 
our  algorithm  outperforms  [51]  in  the  two-camera  setup  and  produces  qualitatively 
similar  results  as  those  of  [59,  60]  where  three  cameras  are  needed.  We  observe  that 
it  becomes  possible  to  perform  well  in  the  challenging  two-camera  case  if  covari¬ 
ances  in  2D  flow  and  disparity  estimation  are  explicitly  propagated,  and  disconti¬ 
nuities  are  adequately  modeled  in  a  multi-scale,  region-based  framework.  Finally,  it 
should  be  emphasized  that  our  algorithm  can  be  used  with  a  rig  that  includes  more 
than  two  cameras  without  any  modifications. 
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Fig.  1.  Overview 


The  overview  of  the  computational  steps  involved  in  estimating  multi-scale  3D 
scene  flow  and  its  error  covariance  is  shown  in  Fig.  1.  The  cameras  are  calibrated. 
The  captured  image  streams  are  synchronized  and  rectified.  At  each  image  pyramid 
level,  first  we  compute  the  distributions  of  optical  flow  and  disparity,  then  we  make 
use  of  regions  from  image  segmentation  to  regularize  the  2D  displacements  (i.e., 
optical  flow  and  disparity).  Details  of  the  region-based  approach  are  described  in 
Sections  3.1  and  3.2.  The  estimates  of  the  2D  displacements  and  their  associated 
error  covariance  are  then  used  in  a  weighted  least  squares  formulation  to  estimate 
3D  scene  flow  at  this  level.  The  2D  displacements  and  3D  scene  flow  are  then 
propagated  to  the  next  pyramid  level.  The  final  estimates  are  obtained  at  the  lowest 
pyramid  level.  The  combined  algorithm  is  described  in  Section  3.3.  The  integrated 
approach  allows  a  principled  way  to  propagate  2D  information  to  3D  within  a  multi¬ 
scale  framework. 
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2  Related  Work 


We  broadly  classify  the  related  work  into  two  categories.  The  first  category  includes 
methods  for  3D  motion  estimation  where  the  3D  point  correspondence  is  explicitly 
recovered  over  time.  The  second  category  includes  methods  dynamic  depth  map 
estimation  where  there  is  no  notion  of  3D  point  correspondence. 


2. 1  3D  Motion  Estimation 


There  has  been  a  fairly  large  amount  of  research  done  in  the  area  of  3D  motion 
estimation.  We  group  the  related  work  into  four  categories  based  on  the  setup  and 
assumptions  made. 


2.1.1  Rigid  Motion,  Monocular  Sequence 

Structure-from-motion  techniques  [10,  47,  61]  recover  relative  motion  together 
with  scene  structure  from  a  monocular  image  sequence.  The  scene  is  generally 
assumed  to  be  rigid  [47]  or  piecewise  rigid  [10,  61];  thus,  only  a  restricted  form  of 
non-rigid  motion  can  be  analyzed  via  these  techniques  [3]. 


2.1.2  Non-rigid  Motion,  Monocular  Sequence 

By  making  use  of  a  priori  knowledge,  or  by  directly  modeling  assumptions  about 
the  scene,  techniques  like  [15,  31,  35,  46,  48,  56]  can  estimate  non-rigid  motion 
from  a  monocular  image  sequence. 

In  [31,  35],  a  deformable  model  is  used  and  the  3D  motion  is  recovered  by  estimat¬ 
ing  the  parameters  to  deform  a  predefined  model. 

The  older  work  of  [48]  assumes  that  the  motion  minimizes  the  deviation  from  rigid 
body  motion.  Recently,  more  researchers  have  discovered  that  in  practice,  many 
non-rigid  objects  deform  with  certain  structure.  Their  shapes  can  be  regarded  as  a 
rigid  component  plus  a  weighted  combination  of  certain  shape  bases.  In  [15],  the 
shape  bases  are  first  obtained  by  applying  Principal  Component  Analysis  on  the 
shape  points  of  a  sparse  mesh  tracked  by  a  stereo  tracking  algorithm;  then  this  set 
of  shape  bases  is  used  for  online  model-based  monocular  facial  tracking.  In  [46], 
an  Expectation-Maximization  (EM)  method  is  proposed  to  simultaneously  estimate 
3D  shape  and  motion  for  each  time  frame.  This  method  learns  the  parameters  of  a 
Gaussian,  and  robustly  fills-in  missing  data  points  under  the  orthographic  projec¬ 
tion  model.  In  [56],  a  set  of  constraints  is  proposed  to  eliminate  the  ambiguity  when 
determining  the  shape  bases.  This  method  can  provide  a  closed-form  solution  un- 


4 


der  the  weak-perspective  projection  model.  Both  [46,  56]  are  batch  methods  since 
tracked  feature  points  from  multiple  video  frames  are  required.  In  all  three  methods 
[15,  46,  56],  only  sparse  3D  motion  can  be  recovered.  Furthermore,  the  assumption 
of  a  linear  combination  of  shape  bases  may  be  insufficient  to  capture  more  complex 
and  subtle  shape  variations. 


2.1.3  Motion  Stereo 

With  multiple  cameras,  stereo  and  2D  motion  information  can  be  combined  to  re¬ 
cover  the  3D  motion,  e.g.,  [25,  30,  41,  53,  57,  62].  Except  for  [25]  and  [30],  nearly 
all  techniques  in  this  category  assume  rigidity  of  the  scene  and/or  rigidity  of  the 
motion.  For  non-rigid  tracking,  [25]  uses  relaxation-based  algorithms  and  [30]  gen¬ 
eralizes  the  deformable  model-based  approach  of  [31].  The  first  approach  cannot 
provide  dense  3D  motion  while  the  latter  approach  needs  a  priori  knowledge  of  the 
scene,  i.e.,  the  deformable  model  for  the  object  to  be  tracked. 


2.1.4  Non-rigid  Motion,  Multi-view 

In  an  approach  closely  related  to  ours,  Vedula,  et  al.,  [51]  introduce  the  concept 
of  dense  scene  flow  as  the  3D  counterpart  of  optical  flow.  They  present  a  linear 
algorithm  to  compute  scene  flow  from  the  optical  flow  fields  of  the  video  streams 
from  multiple  cameras.  Given  scene  flow  and  initial  3D  scene  structure,  dynamic 
scene  structure  can  be  recovered.  Zhang  et  al.,  [59]  reformulate  the  scene  flow 
estimation  problem  in  terms  of  energy  minimization;  scene  flow  is  computed  by 
fitting  an  affine  motion  model  to  image  blocks  with  global  smoothness  constraints. 
This  algorithm  is  further  improved  in  [60]  so  that  discontinuities  are  preserved  and 
occlusions  are  handled.  However,  in  [60],  the  depth  information  in  the  occluded 
area  is  simply  ignored  although  it  has  been  computed;  the  3D  scene  flow  in  this 
case  is  simply  estimated  from  the  multiple  view  optical  flow  constraints.  In  [36], 
Pons,  et.ai,  avoid  the  explicit  computation  of  optical  flow.  They  back-project  the 
images  onto  the  instantaneous  3D  surface  of  the  scene,  and  the  depth  recovery  and 
3D  scene  flow  are  modeled  within  an  energy  minimization  framework.  The  costs 
used  in  the  energy  minimization  are  global  and  local  statistical  measures  on  the 
back-projected  images.  Occlusions  are  not  handled  in  this  approach.  As  is  the  case 
with  other  energy  minimization  approaches  in  this  category  [59,  60],  some  weights 
on  the  regularization  terms  must  be  determined  beforehand  [36]. 


2.2  Dynamic  Depth  Map 


Approaches  like  [11,58,  45]  recover  a  dynamic  depth  map  over  time  by  making  use 
of  motion  constraints,  but  do  not  output  3D  scene  flow.  In  [45],  the  scene  is  assumed 
to  be  piecewise  planar.  Motion  constraints  are  used  to  predict  the  depth  map  of  the 
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planar  patch  in  the  next  time  step.  Patches  are  merged  together  through  a  greedy 
hypothesis  testing  algorithm  that  minimizes  an  image  matching  cost.  Two  other 
approaches  in  this  category  [11,58]  recover  shape  from  dynamic  scenes  by  finding 
correspondence  in  a  3D  space  time  window.  Both  approaches  are  batch  methods 
as  multiple  frames  are  needed  for  constructing  the  space  time  window;  hence  these 
techniques  are  not  suitable  for  online  algorithms.  Furthermore,  [11,  58]  require 
active  illumination  to  improve  the  accuracy  in  correspondence  matching.  All  three 
systems  only  output  dynamic  depth  maps;  there  is  no  point-to-point  correspondence 
computed  over  time,  hence  there  is  no  3D  scene  flow  being  computed. 

Our  proposed  method  aims  to  compute  dense  3D  scene  flow  fields  in  a  multiple 
camera  setup  by  combining  simultaneous  2D  optical  flow  with  stereo-depth  infor¬ 
mation.  No  assumption  of  the  scene  or  3D  motion  is  made.  In  particular,  instead 
of  trying  to  avoid  or  eliminate  the  uncertainty  in  the  computation  of  3D  scene 
flow,  we  model,  incorporate  and  propagate  the  uncertainty  information  in  a  prin¬ 
cipled  and  consistent  way  when  we  solve  for  the  3D  scene  flow.  The  first  benefit 
of  incorporating  the  uncertainty  information  is  that  we  improve  the  accuracy  of 
Vedula’s  [51]  method.  Another  benefit  of  propagating  the  uncertainty  information 
is  that  the  propagated  3D  uncertainty  information  is  useful  for  applications  like  3D 
tracking  and  3D  motion  analysis.  The  proposed  algorithm  also  marries  a  global 
method  and  a  segmentation  based  local  method  in  a  multi-scale  framework  to  pre¬ 
serve  discontinuities  and  to  fill  in  information  when  the  areas  of  images  are  of  low 
texture.  Occlusions  are  also  handled  through  the  region  merging  process,  where 
occluded  regions  may  take  on  valid  estimates  from  one  of  the  adjacent  regions.  In 
experiments  with  synthetic  data,  we  show  that  by  incorporating  the  covariance  in¬ 
formation  from  optical  flow  and  disparity,  the  estimation  errors  measured  in  terms 
of  angle  and  magnitude  of  the  estimated  3D  scene  flow  are  at  least  one  standard 
deviation  smaller  than  those  obtained  via  [51].  In  experiments  with  real  data,  the 
results  we  obtain  with  a  two-camera  setup  are  comparable  to  the  results  presented 
in  [60]  where  three  cameras  are  used. 


3  Approach 


Both  optical  flow  and  disparity  estimation  can  be  formulated  as  problems  of  find¬ 
ing  corresponding  points  in  two  images.  While  optical  flow  estimation  finds  such 
correspondences  in  images  taken  at  different  times,  disparity  finds  them  in  images 
captured  by  cameras  in  different  views.  Surveys  of  related  techniques  in  [5,  39] 
show  that  although  researchers  have  worked  to  estimate  optical  flow  and  disparity 
separately  in  the  past,  the  techniques  used  to  solve  these  two  problems  are  simi¬ 
lar.  These  techniques  can  be  generally  categorized  as:  phase-based,  energy-based, 
feature -based  and  area-based  methods.  Phase-based  methods  [13,  21,  24,  54,  55] 
make  use  of  Fourier  phase  information.  Energy-based  methods  [1,  2,  19,  20,  28, 
33,  34,  37,  40,  43]  minimize  a  cost  function  plus  a  regularization  term  in  a  vari- 
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ational  framework  to  solve  for  the  2D  displacements  (optical  flow  and  disparity). 
Feature-based  methods  [4,  7,  16]  match  features  ( e.g .  points,  edges,  curves,  etc.) 
in  two  images.  Area  based  methods  [12,  14,  22,  23,  27]  find  optical  flow/disparity 
by  correlating  image  patches  across  images,  e.g.  correlation,  mutual  information, 
etc.  Thus,  closely-related  methods  have  been  developed  for  both  optical  flow  and 
disparity  computation. 

To  handle  the  discontinuities  when  estimating  flow  and  disparity,  there  are  methods 
that  make  use  of  color  image  segmentation,  or  a  combination  of  flow/disparity  and 
color,  and  local  region  model  fitting  when  computing  optical  flow  [6,  18,  32,  63] 
and  stereo  [45,  64].  None  of  these  methods  make  use  of  uncertainty  information  for 
the  flow/stereo  estimates. 

In  this  section,  we  first  give  a  general  formulation  of  the  optical  flow  and  dispar¬ 
ity  problem,  followed  by  the  description  of  a  previous  global  approach  by  [43] 
that  puts  the  optical  flow  estimation  problem  in  a  maximum  a  posterior  (MAP) 
framework.  The  global  approach  suffers  from  over-regularization,  especially  in  a 
multi-scale  approach  due  to  coarse-to-fine  information  propagation.  To  overcome 
this  problem,  we  introduce  a  local  motion  model  fitting  method  that  makes  use  of 
image  segmentation.  The  model  fitting  process  involves  the  use  of  a  weighted  least 
squares  method.  Hence  the  output  from  the  motion  fitting  process  is  just  the  linear 
transformation  of  the  MAP  estimates  from  [43].  The  region-based  approach  is  then 
generalized  to  address  the  problem  of  disparity  estimation. 

Let  I(x,  y.  a)  be  the  function  of  pixel  position  and  time/view  for  the  image  signal 
(a  —  t  for  time,  a  =  c  for  camera  view).  Let  v  be  the  2D  pixel  displacement  caused 
by  change  in  time  or  camera  view.  Commonly,  the  goal  is  to  find  v  such  that 

V/  ■  v  +  JQ  =  0.  (1) 


In  the  following  derivations,  we  use  /  to  represent  I(x,y,a )  for  simplicity.  V/ 
represents  the  2D  spatial  gradient  vector  of  the  image  and  Ia  represents  the  change 
in  image  caused  either  by  time  or  view.  The  same  equation  has  been  used  both 
in  the  context  of  optical  flow  [20]  and  stereo  vision  [29].  This  problem  is  under¬ 
constrained  as  we  are  given  a  single  linear  equation  for  solving  two  unknowns 
(|v|  =  2).  To  get  around  this,  some  form  of  regularization  is  usually  employed.  The 
common  formulation  is  to  minimize: 

£(v)  =  E(VVv  +  4*)2>  (2) 


where  i  is  the  index  of  the  pixel  in  a  predefined  neighborhood  O.  Minimizing 
Eq.  2  enforces  smoothness  of  2D  pixel  displacements  in  the  neighborhood  This 
smoothness  constraint  is  often  violated  when  there  is  a  large  displacement  of  the 
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pixels  in  the  neighborhood  17  between  the  two  images  captured.  To  alleviate  this 
problem,  multi-resolution  based  approaches  are  widely  adopted. 

Simoncelli  el  al.  proposed  an  approach  that  computes  distributions  of  optical  flow 
[43].  This  approach  computes  the  covariance  of  the  flow  estimates  at  each  pixel 
based  on  the  contrast  properties  over  a  local  neighborhood  at  multiple  scales.  In 
this  paper,  we  adapt  this  approach  in  an  improved  formulation.  Variance  estimates 
from  [42]  can  provide  us  with  the  knowledge  of  the  error  in  the  coarser  scale  es¬ 
timates  so  that  we  can  make  corrections  during  the  estimation  at  finer  scales.  Our 
proposed  approach  takes  care  of  the  over- smoothing  problem  of  [43]  and  still  pre¬ 
serves  the  useful  property  of  producing  an  estimate  of  the  flow  distribution  at  each 
pixel.  Our  approach  is  extended  to  estimate  disparity  distributions  in  stereo.  Given 
the  distributions  of  optical  flow  and  disparity,  we  compute  3D  scene  flow  via  an 
integrated  algorithm  using  weighted  least  squares,  as  described  in  Section  3.3.  As 
will  be  seen  in  the  experiments,  utilizing  flow  and  disparity  distribution  informa¬ 
tion  in  our  formulation  to  3D  scene  flow  estimation  yields  superior  results  to  [51] 
in  the  case  of  just  two  cameras.  Compared  to  another  region-based  approach  [60], 
the  proposed  algorithm  is  simple,  has  fewer  system  parameters  to  set,  and  yields 
good  estimation  results  in  the  experiments. 


3.1  Distributions  of  Optical  Flow 


Following  [43],  the  uncertainty  in  optical  flow  computation  is  described  through 
the  use  of  a  Gaussian  noise  model, 

V/  ■  (v  -  ni)  +  It  =  n2.  (3) 


The  image  intensity  signal  is  represented  as  a  function  I  of  position  (denoted 
by  image  coordinates  x  and  y)  and  time  (denoted  by  t).  The  image  gradient  is 
V/  =  (Ix(x,  y,  t).  Iy(x.  y,  t))T  and  the  temporal  derivative  of  the  image  is  It.  The 
first  random  variable  ~  J\f( 0,  A(),  describes  the  error  resulting  from  the  fail¬ 
ure  of  the  planarity  assumption  (i.e.,  v  is  constant  in  a  small  region).  The  second 
random  variable,  n2  ~  Af( 0,  A2)  describes  the  errors  in  the  temporal  derivative 
measurements. 


Based  on  the  constant  brightness  assumption  and  from  Eq.  3,  a  MAP  estimate  of  v 
can  be  derived.  Let  Ap  be  the  prior  distribution  of  v,  each  optical  flow  vector  (per 
pixel)  is  considered  as  a  normal  distribution  with  mean  flow  v  and  covariance  Av: 


Av  = 


y _ _ +  a-1 

2-anVI(xi,yut)\\2  +  ^  P 


i  -i 


(4) 
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(5) 


=  — Av-E 


9i^i 


r  °'il|V/(a;i,2/i,t)||2  +  <r| 


2  ’ 


where 


M  =  V/V/7  = 


X  42  44 
\44  4 


,  b  = 


IJt 


IJt 


Vv 


and  gi  is  the  weight  assigned  to  the  neighboring  pixel  i,  a\I  =  Ai  (Z  is  the  identity 
matrix)  and  o\  =  A2.  In  practice,  the  neighborhood  Q  often  refers  to  a  m  by  m 
window  and  a  2D  Gaussian  filter  g  of  the  same  size  is  applied  to  get  the  weighted 
sums  in  Eq.  4  and  Eq.  5  [43].  The  inclusion  of  Ap  makes  Eq.  4  well-conditioned. 


3.1.1  Coarse-to-Fine  Estimation  of  Flow  Distribution 

The  constant  brightness  assumption  made  in  deriving  Eq.  4  and  5  is  often  violated 
when  relatively  large  motion  (typically  >  2  pixels)  occurs,  to  alleviate  this  prob¬ 
lem  and  to  propagate  the  uncertainty  information  at  coarser  scale  levels  (lower- 
resolution  images)  to  finer  scale  levels  (higher-resolution  images),  Simoncelli  de¬ 
veloped  a  filter-based  coarse-to-fine  algorithm  [42].  Instead  of  the  traditional  ap¬ 
plication  of  Kalman  filtering  to  propagate  information  over  time,  [42]  propagates 
information  across  different  scales.  We  only  describe  the  basic  solution  here. 

First,  we  define  a  state  evolution  equation  for  the  estimated  flow  field  v, 

vl  =  Ei-V_1  +  n0,  n0  ~  A/"(0,  A0),  (6) 


where  l  is  an  index  for  scale  such  that  larger  values  of  l  correspond  to  higher  res¬ 
olution  level  in  the  image  pyramid.  E  is  a  linear  interpolation  operator  used  to 
extend  the  coarse  scale  flow  fields  to  the  finer  scale  flow  fields,  which  is  analogous 
to  the  state  evolution  matrix  in  the  standard  Kalman  filtering  framework.  The  ran¬ 
dom  variable  n0  represents  the  uncertainty  of  the  prediction  of  the  finer-scale  flow 
fields  from  the  coarser-scale  flow  fields;  it  is  assumed  to  be  point- wise  independent, 
zero-mean  and  normally-distributed. 

The  measurement  equation  is  defined  based  on  Eq.  3: 

-/'  =  V/,-v,  +  (n2  +  V/'-n1).  (7) 


Applying  the  standard  Kalman  filter  framework  (replacing  the  traditional  Kalman 
filter  time  index  t  with  scale  index  /),  given  Eq.  6  and  Eq.  7,  an  optimal  estimator 
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for  vl  is  derived  from  the  estimate  of  the  coarse  scale  1  and  a  set  of  fine  scale 
derivative  measurements: 


A'1  =  El~1A^1(El~1)T  +  A0  , 


“  (V/')T[A,i  +  Ai]V/i  +  A2  ’ 
ul  =  -i\  -  (vil)T'El~1vl~1 , 
vl  =  E,-1vl~1  +  Klvl  , 

=  An  —  Kl(VIl)TAn  .  (8) 

In  the  above  equations,  Kl  is  the  Kalman  gain  and  ul  is  the  process  innovation 
which  is  approximated  as  the  temporal  derivative  of  the  warped  images.  The  Kalman 
filter  is  directly  related  to  recursive  weighted  least  squares  [44].  In  our  case,  it  is 
used  to  propagate  covariance  information  from  the  previous  spatial  scale  to  the  cur¬ 
rent  spatial  scale.  A  justification  for  this  approximation  and  more  details  can  be 
found  in  [42]. 


3.1.2  Region-based  Parametric  Model  Fitting 

Simoncelli’s  approach  [43]  tends  to  over-smooth  the  solution  due  to:  (1)  uniform 
window  size  for  defining  a  neighborhood,  and  (2)  level  to  level  propagation  of 
information. 

One  solution  to  the  problem  of  uniform  window  size  is  to  use  window  sizes  that  are 
adaptive  to  local  image  properties.  Given  that  information  propagation  is  actually 
the  desirable  property  of  a  multi-scale  approach,  it  is  hard  to  address  the  over¬ 
smoothing  problem  caused  by  level-to-level  information  propagation.  To  solve  this 
problem,  we  take  inspiration  from  [6]  by  making  use  of  a  parametric  model  to  fit 
flow  within  the  regions  obtained  from  image  segmentation.  In  the  motion  estima¬ 
tion  literature  [6,  52],  it  is  commonly  assumed  that  motion  of  the  pixels  within 
the  same  region  can  be  fitted  to  a  parametric  model.  Following  the  conventions  in 
[6,  52],  we  give  a  brief  description  of  the  motion  model  fitting  process. 

For  each  pixel,  denoted  by  its  coordinates,  x,:  =  (a:,,  y,  ),  within  the  same  region, 
one  of  the  following  models  is  selected  by  the  algorithm  to  fit  flow  vectors: 


F(xi) 

F(xi) 


1  0 
0  1 


,  a  —  [do  «3]T) 

a  =  [q-q  cq  fl 2  ®4 


1  Xi  Hi  0  0  0 

0  0  0  1  Xi  yi 
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,  a.  —  [ciq  04  g<2  Gq  G7  G3  04  0,5]^. 


1  Xi  yt  x\  XiUi  0  0  0 
0  0  0  yf  1  Xi  yi 


The  two-parameter  model  corresponds  to  translation,  the  six-parameter  model  cor¬ 
responds  to  affine  motion,  and  the  eight-parameter  model  corresponds  to  quadratic 
motion.  The  specifications  of  these  models  follow  the  convention  of  [6,  52]. 

Minimizing  the  following  weighted  least  squares  equation  yields  an  estimate  of  the 
model  parameters  ar  for  region  r, 

r 

ar  =  argmin^(v  -  F(xi)ar)TA^1(v  -  F(xi)ar).  (9) 


Though  this  formulation  is  similar  in  spirit  to  that  of  [6],  the  robust  error  norm 
is  not  used  as  we  have  an  uncertainty  model  for  v  from  Eq.  4  and  Eq.  5,  which 
explicitly  models  the  local  image  intensity  information.  Pixels  in  the  region  with 
reliable  estimates  of  v  naturally  carry  more  weight  in  the  fitting  process.  These 
pixels  correspond  to  edge  pixels  or  the  regions  with  rich  texture.  Hence  the  fitting 
tends  to  be  more  robust.  In  this  combined  approach,  the  cost  function  of  Eq.  9  is  still 
convex  and  guaranteed  to  have  an  optimal  solution  given  enough  pixels  within  the 
region.  Let  a  be  the  optimal  solution,  the  updated  flow  field  v'  and  corresponding 
covariance  A'v  are  computed  as  follows: 


v/  =  F(xi)a, 

Aa=  (J(xj)TA“1J(xi))_1, 

A^,  =  F(xj)AaF(x,;)T,  (10) 

where  J(x()  is  the  Jacobian  of  F  (  x,  )  evaluated  at  x,. 

In  the  combined  approach,  first,  image  segmentation  based  on  color/intensity  infor¬ 
mation  is  performed  at  each  resolution  level  of  the  image  pyramid.  In  our  imple¬ 
mentation,  segmentation  is  obtained  via  mean  shift  [9].  The  order  of  the  parametric 
model  used  for  fitting  follows  the  same  set  of  rules  as  defined  in  [6].  Hence,  the 
order  of  the  parametric  model  is  adaptive  to  the  resolution  level,  region  size  and 
fitting  residual  error.  A  lower-order  model  is  always  preferred  if  a  higher-order 
model  fails  to  reduce  the  error  residual.  When  the  residual  error  of  fitting  an  eight- 
parameter  model  is  still  high  and  the  region  size  is  large,  the  region  is  split  by  using 
mean  shift  [9]  on  the  region  flow  field  as  color/intensity  information  alone  is  not 
enough.  Model  fitting  is  then  performed  on  the  newly  split  regions.  This  step  can 
be  recursive;  the  stopping  criteria  is  either  the  region  is  small  enough  or  the  error 
residual  is  below  a  specified  threshold.  Figure  3.1.2  shows  the  process  of  com¬ 
puting  optical  flow  and  the  resulting  flow  field  on  the  Yosemite  sequence  [17].  In 
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Fig.  2.  Example  of  flow  computation  for  Yosemite  sequence  [17], 


Table  1,  the  angular  error  indicates  that  by  using  the  combined  approach,  we  are 
able  to  reduce  the  angular  error  in  terms  of  mean  and  standard  deviation  compared 
to  [6,  42].  Our  method  produced  error  close  to  the  state  of  the  art  method  [34],  but  it 
is  pointed  out  later  in  [8]  that  the  ground  truth  delivered  with  the  Yosemite  sequence 
has  problems,  making  it  difficult  to  assess  the  significance  of  slight  difference  of 
error  reported  for  this  sequence.  Thus  we  consider  the  performance  respectable, 
while  also  suitable  for  use  in  our  unified  framework  for  disparity  and  3D  scene 
flow  estimation. 
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Method 

Average  Angular  Error  (AAE) 

Standard  Deviation  of  AAE 

Simoncelli,  et  al.  [42] 

3.81° 

7.09° 

Black-Jepson  [6] 

2.29° 

2.25° 

Our  Method 

1.17° 

2.08° 

Brox, et  al.  [34] 

0.99° 

1.12° 

Table  1 

Evaluation  results  of  the  proposed  approach  on  the  Yosemite  sequence.  Flow  computed  at 
the  cloudy  sky  area  is  not  used  for  error  computation. 

3.2  Distributions  of  Disparity 


Similar  region  model  fitting  algorithms  [26,  45]  have  also  been  used  for  depth  map 
computation  so  that  sharp  boundaries  can  be  preserved.  Hence  with  slight  modifi¬ 
cation,  the  same  integrated  algorithm  for  optical  flow  computation  can  be  used  to 
compute  disparity  for  input  image  pairs  captured  by  a  stereo  rig.  To  do  this,  we 
just  substitute  the  time  index  t  and  t  +  1  with  the  view  index  L  and  M,  where  L 
refers  to  left  view  and  R  refers  to  the  right  view  in  a  binocular  stereo  rig.  Only 
horizontal  displacement  and  corresponding  variances  are  computed.  Hence  the  es¬ 
timation  of  disparity  can  be  solved  in  the  same  way  as  optical  flow.  By  using  the 
method  to  estimate  optical  flow  and  disparity,  we  are  able  to  combine  them  to¬ 
gether  in  a  consistent  way  for  3D  scene  flow  estimation.  Fig.  3  shows  the  pro¬ 
cess  of  computing  disparity  for  the  standard  Teddy  data  set  [50].  We  have  tested 
the  performance  of  our  algorithm  using  the  evaluation  tool  and  data  set  provided 
at  http  :  / /www .  middlebury .  edu/ stereo,  our  algorithm  on  average  ranks 
among  the  top  ten  on  the  data  set  provided  as  shown  in  Table  2. 

One  of  the  evaluation  parameters  provided  by  the  online  evaluation  system  [50] 
is  “E.T.”,  which  stands  for  error  threshold.  It  is  the  acceptable  disparity  error.  If 
E.T.  =  1,  then  the  estimated  disparity  is  considered  correct  if  the  difference  be¬ 
tween  the  ground  truth  disparity  and  the  estimated  disparity  is  within  1  pixel,  oth¬ 
erwise,  it  is  considered  wrong.  The  evaluation  system  allows  us  to  adjust  the  value 
from  0.5  —  2.0  pixels.  As  shown  in  Table  2,  our  method  ranks  higher  when  the 
acceptable  disparity  threshold  is  larger.  We  believe  this  is  due  to  the  smoothing  ef¬ 
fect  of  the  derivative  filters  used  in  our  2D  displacement  computation.  The  results 
obtained  by  our  method  on  all  four  data  sets  are  shown  in  Fig.  4. 

A  single  consistent  algorithm  for  computing  optical  flow  and  disparity  is  summa¬ 
rized  in  Algorithm  1.  We  use  v  in  Algorithm  1  to  represent  optical  flow  and  dispar¬ 
ity  as  they  are  both  2D  pixel  displacements  between  two  images  I\  and  J2. 

The  proposed  unified  algorithm  presented  in  Alg.  1  is  a  two-stage  algorithm.  In  the 
first  stage,  as  with  [42],  the  estimation  of  2D  displacements  exploits  the  constant 
brightness  assumption.  Hence  the  proposed  algorithm  may  have  problems  when 
estimating  2D  correspondences  for  objects  in  the  scene  that  have  non-Lambertian 
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left  image  right  image 


resulting  disparity 


Fig.  3.  Example  of  disparity  computation  for  Teddy  data  set  [50]. 

surface  reflectance  properties,  or  in  the  case  of  large  baseline  stereo.  The  second 
stage  region  model  fitting  process  used  in  our  algorithm  helps  to  alleviate  the  large 
baseline  problem.  In  experiments  with  real  video  data  captured  with  relatively  large 
baseline  stereo  cameras  (the  observed  largest  displacement  is  around  100  pixels 
between  a  stereo  pair),  the  proposed  algorithm  is  still  able  to  produce  reasonable 
estimates  due  to  the  region  merging  step  where  the  information  fill-in  takes  place. 

In  our  experience,  our  algorithm  gives  a  good  estimate  of  optical  flow  and  dispar- 
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E.T. 

A.R. 

Tsukuba 

Venus 

Teddy 

Cones 

nocc 

all 

disc 

nocc 

all 

disc 

nocc 

all 

disc 

nocc 

all 

disc 

0.5 

10.3 

21.5 

22.5 

23.7 

5.81 

6.12 

9.73 

16.6 

21.7 

33.7 

11.3 

17.5 

18.8 

0.75 

9.4 

21.0 

21.9 

21.3 

0.45 

0.68 

3.11 

10.2 

14.9 

24.3 

5.79 

12.0 

12.7 

1.0 

7.8 

1.64 

2.65 

8.75 

0.13 

0.30 

1.86 

7.59 

11.7 

19.3 

4.74 

10.7 

10.8 

1.50 

5.8 

1.02 

1.83 

5.40 

0.13 

0.30 

1.83 

4.48 

7.12 

13.3 

3.76 

9.21 

9.25 

2.0 

5.4 

0.69 

1.40 

3.74 

0.12 

0.27 

1.72 

3.15 

4.91 

10.0 

3.24 

8.37 

8.32 

Table  2 

Evaluation  results  of  the  proposed  approach  on  the  stereo  data  set  [50]  using  the  evaluation 
measures  of  [38  ]  as  of  August  28,  2006.  The  numbers  represent  the  percentage  of  bad  pixels 
( i.  e. ,  pixel  whose  absolute  disparity  error  is  greater  than  “E.T.”).  “E.T.”  stands  for  Error 
Threshold,  which  is  the  acceptable  disparity  error.  “A.R.”  indicates  the  Average  Ranking 
of  our  algorithm  against  36  other  algorithms  listed  in  [50].  The  errors  reported  in  column 
“nocc”  are  the  errors  only  evaluated  in  the  non-occluded  areas  in  the  image;  the  errors 
reported  in  column  “all”  are  the  errors  evaluated  in  the  whole  image  excluding  the  border 
regions;  and  the  errors  reported  in  “disc  ”  are  the  errors  evaluated  in  the  regions  near  depth 
discontinuities  ( including  both  neighborhoods  of  depth  discontinuities  and  half-occluded 
regions). 

Algorithm  1  Unified  Algorithm  for  Computing  Optical  Flow  and  Disparity 
1:  construct  image  pyramids  of  L  levels  using  the  images  ii  and  J2. 

2:  initialize  Ap  and  A0  used  in  Eqs.  4  and  6. 

3:  for  l  =  0  to  L  —  1  of  the  pyramids  built  do 
4:  if  l  ==  0  then 

5:  compute  vz,  [Eqs. 4  and  5], 

6:  else 

7:  compute  vz,  Alv  [Eq.8]. 

8:  end  if 

9:  segment  image  at  current  level  of  the  pyramid  built  from  I\,  e.g.,  via  mean 

shift  [9], 

10:  for  each  region  from  the  segmentation  do 

11:  fit  a  parametric  model  to  the  vl  of  the  current  region  according  to  the 

process  described  in  [Section  3.1.2],  the  simplest  translation  model  is  al¬ 
ways  used  first. 

12:  compute  v\  Aj,  [Eq.  10]  for  pixels  in  that  region. 

13:  end  for 

14:  set  =  v'  and  A^  =  A^. 

15:  end  for 

ity.  However,  one  could  argue  that  we  should  just  use  the  most  accurate  algorithms 
available  to  solve  optical  flow  and  disparity,  and  then  combine  them  together  to 
solve  for  3D  scene  flow.  The  key  point  here  is  that  the  inaccuracies  in  estimating 
optical  flow  fields  and  disparities  are  inevitable.  It  is  often  more  desirable  to  explic¬ 
itly  model  the  inaccuracies  of  optical  flow  and  disparity  based  image  properties. 
Once  we  gain  estimates  of  both  the  distributions  of  optical  flow  and  disparity,  we 
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left  image 


disparities 


bad  pixels 

(absolute  disparity  error  >1.0) 


signed  disparity  error 


Fig.  4.  Results  on  all  4  data  sets  [50],  The  first  column  shows  the  left  image  of  the  in¬ 
put  stereo  pair.  The  output  disparity  maps  are  displayed  in  the  second  column.  The  error 
threshold  is  1.0,  hence  pixels  have  absolute  disparity  error  >1.0  pixels  are  labeled  as  bad 
pixels  in  the  third  column.  The  signed  disparity  error  maps  are  displayed  in  the  last  column. 

can  use  and  propagate  this  uncertainty  information  in  the  computation  of  the  3D 
scene  flow.  The  end  result  of  this  principled  way  of  modeling  and  propagating  2D 
uncertainty  information  is  that  we  obtain  better  estimates  of  3D  scene  flow  together 
with  the  propagated  error  covariance  information  of  the  3D  estimates. 


3.3  Computing  3D  scene  flow 


Given  the  unified  formulation  for  optical  flow  and  disparity,  we  now  formulate  the 
computation  of  3D  scene  flow.  We  will  assume  that  the  cameras  are  fully-calibrated 
and  do  not  change  in  the  experiments.  Following  [51],  scene  flow  is  defined  as  the 
3D  motion  field  of  the  points  in  the  world,  just  as  optical  flow  is  the  2D  motion 
field  of  the  points  in  an  image.  Optical  flow  is  simply  the  projection  of  the  scene 
flow  onto  the  image  plane  of  a  camera. 
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Given  a  3D  point  X  =  (. X ,  Y,Z),  the  2D  image  of  this  point  in  view  c  is  denoted 
as  xc  =  (xc,  yc).  The  2D  components  of  xc  are 


\PC]1(X,Y,Z,1)T  [Pc]2(X,Y,Z,1)t 

Xc  \PcMX,Y,Z,l)T'  Vc  \Pc]3{X,Y,Z,1)t* 

where  [Pc]  j  is  the  jth  row  of  the  projection  matrix  Pc.  If  the  camera  is  not  moving, 
then  the  2D  motion  v  =  is  uniquely  determined  by  the  following: 


dxc  d xc  dX 
dt  <9X  dt  ’ 


(12) 


To  solve  for  the  scene  flow  V  =  two  equations  are  needed.  Hence  at  least  two 
cameras  are  needed.  The  setup  of  the  system  of  equations  is  simply 

BV  =  U,  (13) 


where 
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az 
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L  ax 

dY 

az  J 

L  at.  J 

(14) 


A  singular  value  decomposition  of  B  gives  the  solution  that  minimizes  the  sum  of 
least  squares  of  the  error  obtained  by  re-projecting  the  scene  flow  onto  each  of  the 
optical  flows. 


3.4  Integrated  Approach 


As  discussed  in  Section  2,  it  is  known  that  the  2D  image  correspondence  problem 
(across  different  views  or  across  different  time  frames)  is  ill-posed.  Hence  it  is 
difficult  to  estimate  scene  flow  reliably  from  optical  flow  and  disparity.  One  way  to 
get  around  this  is  to  use  many  cameras,  as  reported  in  [51],  where  51  cameras  were 
used  to  solve  Eq.  13  reliably. 
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Rather  than  aiming  to  improve  the  accuracy  by  using  more  cameras,  we  propose  to 
incorporate  the  covariances  derived  from  the  computation  of  optical  flow  and  dis¬ 
parity.  By  taking  the  covariances  from  disparity  and  optical  flow  into  account,  the 
linear  system  of  Eq.  13  tends  to  produce  reasonable  scene  flow  even  when  given 
only  a  small  number  of  cameras.  Furthermore,  the  estimated  scene  flow  with  co- 
variances  can  be  used  for  applications  like  probabilistic  3D  tracking  and  3D  motion 
and  structure  analysis. 

For  a  stereo  pair,  the  3D  coordinate  X  is  related  to  the  disparity  d  and  corresponding 
image  coordinates  xp  and  xp  where  L  indicates  left  view  and  R  indicates  right  view. 
Fet  T  denote  the  baseline  and  /  denote  the  focal  length  (both  cameras  are  assumed 
to  have  the  same  focal  length).  The  following  equation  defines  the  relationship 
between  the  3D  coordinates,  2D  image  coordinates  in  the  left  and  right  cameras, 
and  the  pixel  disparity  between  left  and  right  cameras. 


T(x l  +  ^r)  y  _T(yL  +  ys)  _  fT 
2d  2d  d  ' 


Hence  we  solve  Eq.  14  for  scene  flow,  ¥  by: 


V  =  argmin(B¥  -  U)TW_1(B¥  -  U). 


(15) 


(16) 


In  the  binocular  setup,  W  is  derived  from  the  2D  covariances  from  the  disparity 
d  where  Ad  =  diag(ad,  erf),  and  the  2D  flow  field  v  where  Av  =  diag(crlx ,  <rj  ). 
Assuming  independence  of  the  estimates  of  optical  flow  and  disparity,  then 

W  =  A,AV.  (17) 

By  covariance  propagation,  the  error  covariance  of  scene  flow  ¥  is: 

Av  =  (BTW-1B)-1  =  B-1WB-t.  (18) 


Algorithm  2  describes  the  single  integrated  method  for  computing  optical  flow, 
disparity  and  3D  scene  flow.  To  compute  the  scene  flow  for  two  consecutive  frames 
in  the  stereo  video  streams,  we  use  / (L)  to  denote  the  left  video  stream  and  / (R) 
to  denote  the  right  video  stream.  First  we  build  image  pyramids  of  height  L  for 
I(t,  L),  lit  +  1,L),  I(t,  R)  and  I(t  +  1,R).  Pyramid  images  are  indexed  by  /, 
where  l  =  0  is  the  index  for  image  at  the  lowest  resolution  level  and  l  —  L  —  1 
is  the  index  for  image  at  the  highest  resolution  level.  Hence  Il(t,  L)  refers  to  the 
pyramid  image  at  level  l  and  the  image  that  is  used  to  construct  the  pyramid  is 
captured  by  left  camera  at  time  t.  The  optical  flow  fields  computed  at  each  level  of 
the  pyramid  for  the  binocular  views  are  denoted  as  v*(L)  and  v;(R).  Disparity  is 
denoted  as  dl. 
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Algorithm  2  Algorithm  for  computing  3D  scene  flow 
1:  construct  image  pyramids  of  L  levels  using  the  images  lit,  L),  lit  + 
1,L),  I(t,  R)  and  7(t  +  1,  R). 

2:  initialize  A;,  and  A0  used  in  Eqs.  3  and  5. 

3:  for  l  =  0  to  L  —  1  do 
4:  if  l  ==  0  then  ^ 

5:  computeA^L),  v(L)  ,  ALK),  v(R)  ,  A^  and  dl  using  Eqs. 4  and  5, 

6:  else  ^ 

7:  compute  A^(L),  v(L)  ,  A^(R),  v(R)  ,  A^  and  c?  using  Eq.8, 

8:  end  if 

9:  segment  Il(t,  L)  and  Il(t,  R)  via  mean  shift[9], 

10:  for  each  region  of  Il(t,  L)  from  the  segmentation  do 

11:  fit  a  parametric  model  to  the  flow  computed  between  Il(t,  L)  and 

l\t  + 1,  L)  and  disparity  of  Il(t,  L)  and  Il(t,  M)  of  the  pixels  in  the  region 
[Section  4.1.2], 

12:  compute  v;(L),  A^(L),  d'  and  A^  [Eq.  10]  for  every  pixel  in  that  region, 

13:  end  for 

14:  for  each  region  of  Il(t,  M)  from  the  segmentation  do 

15:  fit  a  parametric  model  to  the  flow  computed  between  I1  (t,  M)  and  I1  ( t  + 

1,  R)  of  the  pixels  in  the  region  [Section  4.1.2], 

16:  compute  v'(R),  A^(R)',  d'  and  A'd  [Eq.  10]  for  pixels  in  that  region, 

17:  end  for 

18:  set  v'(L)  =  v'(L),  vz(R)  =  v'(R).  A^(L)  =  AV/(L),  A^(R)  =  AV/(K),  dl  =  d’ 

and  Ald  =  A'd, 

19:  if  l  ==  0  then 

20:  solve  ¥' [Eq.  16], 

21:  else 

22:  solve  Yl  [Eq.  16],  using  ¥;_1  as  the  initial  estimate, 

23:  end  if 

24:  end  for 


4  Experiments 


Two  sets  of  experiments  are  conducted  to  demonstrate  the  effectiveness  of  the 
weighted  least  squares  method  and  the  performance  of  the  algorithm. 


4.1  Synthetic  3D  Data 


To  show  the  effectiveness  of  the  weighted  least  squares  method,  3600  3D  points  on 
a  planar  surface  with  known  3D  scene  flow,  2D  optical  flow  and  disparity  are  gener¬ 
ated.  The  motion  of  the  points  on  the  surface  follow  a  deforming  Gaussian  surface. 
Hence,  each  point  moves  in  slightly  different  direction  with  different  magnitude, 
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which  corresponds  to  a  non-rigid  motion.  Gaussian  noise  with  different  variances 
is  added  to  the  2D  optical  flow  and  disparity.  The  magnitudes  of  the  noise  vari¬ 
ances  range  from  2%  to  10%  of  the  average  displacements  (i.e.,  2D  optical  flow 
and  disparity).  Three  methods  are  tested  with  and  without  propagating  the  noise  es¬ 
timated  while  computing  2D  optical  flow  and  disparity.  Accuracy  of  the  computed 
3D  scene  flow  is  measured  using  the  average  angular  error  and  average  magnitude 
between  computed  3D  scene  flow  and  known  3D  motion.  The  mean  and  standard 
deviation  of  the  angular  and  magnitude  error  of  the  estimated  3D  scene  flow  are 
reported  based  on  the  average  of  10  runs  of  the  experiments  at  each  noise  level. 

Method  1:  Eq.  13  without  incorporating  covariance  [51]. 

Method  2:  Eq.  16  where  only  the  covariance  of  2D  optical  flow  is  used. 

Method  3:  Eq.  16  where  both  the  covariance  of  2D  optical  flow  and  the  variance 
disparity  are  used. 

Fig.  5  shows  the  mean  and  standard  deviation  of  angular  and  magnitude  error  and 
Fig.  6  shows  the  results  of  sample  frames  based  on  the  recovered  3D  scene  flow. 
The  insight  we  gain  from  these  experimental  results  is  that  taking  noise  into  con¬ 
sideration  yields  more  reliable  3D  scene  flow  estimates.  When  estimating  3D  scene 
flow  with  real  image  data,  the  computations  of  optical  flow  and  disparity  are  always 
inaccurate  due  to  the  camera  noise  and  the  image  properties,  e.g.,  image  regions 
with  no  texture  or  repetitive  texture,  image  regions  with  low  contrast  and  motion 
blur  when  we  capture  image  sequences,  etc.  Hence,  even  when  equipped  with  the 
most  accurate  optical  flow  and  disparity  algorithms,  the  2D  image  quantities  still 
cannot  always  be  evaluated  accurately.  However,  by  carefully  choosing  the  right 
algorithms  to  account  for  these  errors  and  taking  them  into  consideration  when  es¬ 
timating  3D  scene  flow,  we  can  improve  the  accuracy  of  estimates.  The  importance 
and  effectiveness  of  Algorithm  2  are  demonstrated  with  real  video  sequences  in  the 
next  experiment. 


4.2  Real  Videos 


To  evaluate  the  algorithm  in  practical  applications,  experiments  have  been  con¬ 
ducted  with  videos  of  real  scene  sequences.  In  all  the  experiments  conducted, 
<j i  =  0.008  (o'!  is  related  to  the  error  from  the  failure  of  the  assumption  that  the 
displacements  are  constant  in  a  small  region),  a-2  =  0.0  (cr 2  is  related  to  error  when 
computing  temporal  derivatives)  and  a0  =  0.10  (<j0  is  diagonal  entry  in  the  covari¬ 
ance  matrix  during  information  propagation)  and  initial  crp  =  0.5  (ap  represents  the 
prior  distribution  information  of  the  displacements).  Five-level  image  pyramids  are 
used  in  all  the  test  cases.  These  parameters  are  determined  during  experiments.  The 
results  from  three  different  video  sequences  are  presented  in  Figs.  7  and  8. 
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Mean  Angular  Error  (degree)  Standard  Deviation  of  Angular  Error  (degree) 


Variance  of  Gaussian  Noise  Variance  of  Gaussian  Noise 

Mean  Magnitude  Error  Standard  Deviation  Magnitude  Error 


Variance  of  Gaussian  Noise  Variance  of  Gaussian  Noise 

Fig.  5.  Angular  error  (first  row)  and  magnitude  error  ( second  row )  of  synthetic  data  with 
added  Gaussian  noise. 

The  sequences  were  captured  with  Videre  MEGA-D  system:  a  binocular  stereo 
camera  connected  with  Matrox  capture  card  through  fire  wire  cable.  The  frame 
rate  of  the  stereo  sequence  is  around  30  frames/sec  with  resolution  of  320  x  240. 
The  scene  flow  algorithm  is  implemented  Matlab  and  C++.  Experiments  were  con¬ 
ducted  on  an  AMD  Athlon  MP  2100+  machine.  Dense  scene  flow  is  computed 
for  each  frame  in  about  two  minutes  per  frame.  The  acquired  sequences  are  rec¬ 
tified  and  the  calibration  information  is  given.  The  binocular  video  sequences  are 
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(a) 


(b) 


(c) 


(d) 

Fig.  6.  Results  of  sample  frames  from  the  synthetic  data  (noise  level  =  5%).  (a)  surface 
deformation  estimated  by  the  groundtruth  3D  scene  flow;  (b)  surface  deformation  estimated 
by  Method  1  [51];  (c)  surface  deformation  estimated  by  Method  2;  (d)  surface  deformation 
estimated  by  Method  3. 

acquired  in  an  uncontrolled  illuminated  environment  as  seen  in  Fig.  7;  hence  the 
estimates  of  optical  flow  and  disparity  tend  to  be  noisy. 

In  Fig.  7,  the  observable  motions  in  the  first  sequence  are  the  backward  movement 
of  right  hand  and  the  forward  movement  of  left  hand.  The  second  row  of  Fig.  8 
shows  the  2D  projection  of  the  estimated  3D  flows  in  the  left  and  right  views,  the  Z 
velocities  and  the  variances.  From  the  result,  we  can  see  that  the  3D  movements  of 
the  left  and  right  hands  have  been  described  reliably.  In  the  second  sequence,  both 
hands  are  moving  forward.  The  projections  and  Z  motion  of  the  3D  flow  demon¬ 
strate  similar  reliability. 

To  show  the  ability  to  recover  non-rigid  motion,  we  captured  a  sequence  with  a 
deforming  sponge  (by  pushing  the  top  of  the  sponge)  with  painted  texture  patterns. 
Only  the  upper  part  of  sponge  has  obvious  Z  motion  (towards  the  camera)  and 
the  rest  of  the  sponge  only  moves  slightly  towards  the  table.  This  is  shown  clearly 
in  projected  flows  and  the  recovered  Z  motion.  The  variances  of  the  Z  velocities 
are  shown  in  the  last  column  of  images.  The  variances  give  information  about  the 
reliability  of  the  estimates.  Darker  areas  indicate  lower  variance  and  brighter  areas 
represent  higher  variance.  From  Fig.  7,  one  can  see  that  the  variance  is  tied  to 
the  2D  image  properties,  e.g.,  local  image  contrast  and  texture  information.  This 
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observation  again  verifies  the  promise  of  our  proposed  method.  One  thing  to  note  is 
that  there  is  a  bit  of  shadow  movement  being  captured  by  the  algorithm.  Exploiting 
strong  geometric  cues  ( e.g .,  the  desk  is  flat)  would  help  remove  this  artifact,  but  this 
is  not  the  focus  of  the  proposed  algorithm. 

In  the  sequences  we  captured,  the  largest  displacement  between  the  stereo  pairs 
is  around  100  pixels  and  there  are  occluded  areas  present  in  all  the  captured  se¬ 
quences.  However,  we  still  can  produce  reasonable  estimates  in  those  areas.  This  is 
due  to  the  region  merging  step  where  the  information  fill-in  takes  place. 

Comparisons  with  the  approach  of  [5 1]  are  shown  in  Fig.  8.  The  inputs  to  Eq.  13  are 
obtained  using  Algorithm  1,  but  the  covariances  are  not  used.  Without  ground  truth 
data,  it  is  difficult  to  quantitatively  evaluate  the  results.  Here  we  use  the  error  image 
of  the  warped  image  of  I (t)  and  the  target  image  I (f+1).  If  the  projected  flow  fields 
are  accurate,  we  should  have  smaller  warped  image  error.  In  all  three  test  sequences, 
the  proposed  method  produced  much  smaller  warped  image  errors  compared  to 
those  of  [5 1].  Qualitatively,  our  proposed  method  produced  much  cleaner  projected 
flow  and  Z-motion  in  all  three  test  sequences.  Similar  results  have  been  shown  in 
[60]  with  three  cameras  while  we  only  used  a  binocular  stereo  rig. 


5  Discussion  and  Conclusions 


A  multi-scale  integrated  algorithm  for  3D  scene  flow  computation  was  proposed.  A 
region-based  probabilistic  algorithm  was  introduced  to  compute  the  distributions  of 
optical  flow  and  disparity.  Covariances  and  variances  from  the  probabilistic  multi¬ 
scale  framework  for  optical  flow  and  disparity  computation  are  combined  to  esti¬ 
mate  3D  scene  flow.  Occlusions  due  to  large  displacement  are  handled  through  a 
region  merging  process  that  allows  occluded  regions  to  take  on  valid  estimates  from 
adjacent  regions.  Experiments  with  synthetic  and  real  data  demonstrate  much  better 
performance  with  just  two  cameras  compared  to  [51].  This  superior  performance  is 
obtained  via  taking  care  of  the  uncertainty  from  2D  computation  and  using  region 
information  to  regularize  the  2D  estimates.  At  the  same  time,  the  proposed  method 
computes  covariances  of  the  estimated  3D  scene  flow.  The  covariances  are  prop¬ 
agated  from  the  2D  image  data  and  hence  provide  a  measure  of  how  reliable  the 
estimated  scene  flow  is.  We  expect  that  the  covariances  should  provide  a  good  ini¬ 
tialization  for  related  3D  tracking  algorithms.  Our  proposed  approach  is  general  and 
can  be  used  in  a  multi-camera  setup  (E.g.  [51,  59,  60])  and  should  enable  improved 
estimation  of  3D  scene  flow.  One  way  to  extend  our  approach  in  a  multi-camera 
setup  is  to  choose  a  reference  camera  along  the  lines  of  [59,  60]. 

We  are  currently  investigating  how  to  incorporate  the  proposed  algorithm  in  track¬ 
ing  applications  such  as  vision-based  human-computer  interfaces.  Other  interesting 
applications  include  analyzing  and  annotating  events  in  stereo  video  through  anal- 
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ysis  of  3D  scene  flow.  Future  work  includes  extending  our  formulation  to  exploit 
available  prior  information,  e.g.,  the  shape  information  of  the  object,  to  eliminate 
the  inaccuracy  like  what  is  shown  in  Fig.  7,  where  the  shadow  affects  the  estimates. 
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Fig.  7.  Experimental  results  with  three  real  sequences.  The  motions  presented  in  the  first 
sequence  are  one  fist  moving  forward  and  the  other  backward.  The  second  sequence  shows 
the  motion  of  both  hands  moving  forward.  A  deforming  sponge  (by  pushing  the  top)  is 
shown  in  the  third  sequence.  In  the  rows  that  demonstrate  the  estimation  results,  the  first 
two  columns  are  the  projections  of  estimated  3D  scene  flow  in  left  and  right  view,  the  third 
column  is  the  Z  velocity  intensity  image,  the  darker  area  represents  the  hand  moving  away 
from  the  camera,  the  brighter  area  indicates  the  hand  moving  towards  the  camera,  the  last 
column  shows  the  variances  of  the  Z  velocity  where  the  darker  areas  represent  the  places 
where  the  estimates  for  Z  velocity  are  more  reliable  and  the  brighter  areas  represent  the 
places  where  the  Z  velocity  estimates  are  less  reliable. 


28 


Using  Eq.13  Our  method  Using  Eq. 13  Our  method  Using  Eq.13  Our  method 


projected  left  flow 


projected  right  flow 


z  flow 


warped  image  error  (left) 


Fig.  8.  Comparison  with  the  estimation  results  using  Eq.  13  with  the  three  video  sequences. 
The  first,  third  and  fifth  rows  show  the  results  of  using  Algorithm  2  while  the  second,  fourth 
and  sixth  rows  show  the  results  using  Eq.  13,  which  is  equivalent  to  [51  ]. 
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